Linear Models

George Savva (Quadram Institute Bioscience)

Introduction

Session outline

  • Understanding linear models
  • Rethinking statistics in terms of LMs
  • Examples, discussion and practical work with R

Getting our R packages

Start RStudio..

Make sure you have these packages installed..

install.packages(c("UsingR", "equatiomatic", "emmeans", "broom", 
                   "performance", "ggplot2", "gamlss", "car",
                   "patchwork", "ggbeeswarm", "readxl"))
library(UsingR) # dataset
library(readxl) # reading data from excel files
library(equatiomatic) # making equations from models
library(emmeans) # marginal means
library(broom) # enhancements to model
library(performance) # model diagnostics
library(ggplot2) # making graphs
library(gamlss) # enhanced modelling
library(car)    # hypothesis tests 
library(ggbeeswarm) # beeswarm graphs
library(patchwork) # place plots side-by-side

Files for the session:

Data:

  • diversity.csv
  • survey.csv
  • observations.csv

Code, slides and handouts:

  • presentation.qmd
  • presentation.html
  • handout.html
  • linearmodels.pptx

What is a linear model?

  • What questions might we ask?
  • What form could the answer take?

What is a linear model?

\[\text{Height}=a+b\times \text{Age}\]

What is a linear model

  • Describes a linear relationship between an outcome and its predictors.

  • What is the model for?

  • Who is interested and what will it be used for?

Fitting our model in R

model1 <- lm( height ~ age , data = kid.weights )
extract_eq(model1)

\[ \operatorname{height} = \alpha + \beta_{1}(\operatorname{age}) + \epsilon \]

tidy(model1)
# A tibble: 2 x 5
  term        estimate std.error statistic   p.value
  <chr>          <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)   25.0     0.472        52.9 4.49e-137
2 age            0.241   0.00758      31.7 2.56e- 89
extract_eq(model1, use_coefs=TRUE)

\[ \operatorname{\widehat{height}} = 24.98 + 0.24(\operatorname{age}) \]

What can we do with this?

  • Describe the relationship?
  • Predict new values of height from age?
newages = data.frame(age=12*(1:20))

model1 |> predict (newdata = newages) |> cbind(newages)
   predict(model1, newdata = newages) age
1                            27.87171  12
2                            30.75998  24
3                            33.64825  36
4                            36.53652  48
5                            39.42478  60
6                            42.31305  72
7                            45.20132  84
8                            48.08959  96
9                            50.97786 108
10                           53.86613 120
11                           56.75440 132
12                           59.64267 144
13                           62.53094 156
14                           65.41921 168
15                           68.30748 180
16                           71.19574 192
17                           74.08401 204
18                           76.97228 216
19                           79.86055 228
20                           82.74882 240

Prediction intervals

model1 |> augment(interval="prediction") |>
  ggplot(aes(x=age,y=height))   + 
  theme_bw(base_size = 18)+ 
  geom_ribbon( aes(y=.fitted, ymin=.lower, ymax=.upper), alpha=0.2,fill="blue") + 
  geom_line( aes(y=.fitted)) + 
  geom_point() +
  labs(x="Age (months)",y="Height (ins)")

Confidence intervals

model1 |> augment(interval="confidence") |>
  ggplot(aes(x=age,y=height))   + 
  theme_bw(base_size = 18)+ 
  geom_ribbon( aes(y=.fitted, ymin=.lower, ymax=.upper), alpha=0.2,fill="blue") + 
  geom_line( aes(y=.fitted)) + geom_point() +
  labs(x="Age (months)",y="Height (ins)")

Our model is done:

extract_eq(model1)

\[ \operatorname{height} = \alpha + \beta_{1}(\operatorname{age}) + \epsilon \]

extract_eq(model1, use_coefs = TRUE)

\[ \operatorname{\widehat{height}} = 24.98 + 0.24(\operatorname{age}) \]

Standard deviation of ‘error’ = 4.77

\[\epsilon \sim N(0,4.77^2)\]

Summary 1

  • A linear model describes the linear relationship between an outcome and predictors

  • Can be used for prediction or inference

  • Simple linear models are fit in R using lm.

  • broom::augment makes it easy to get predictions.

  • Visualisations allow us to check model fit.

  • Confidence intervals reflect uncertainty around means

  • Prediction intervals reflect likely variation in new points

Example 2 - Weight by Age

Can we apply the same method to this dataset?

Can you forsee a problem?

ggplot(kid.weights, aes(x=age,y=weight))   + 
  theme_bw(base_size = 18)+ 
  geom_point() + labs(x="Age (months)",y="Weight (lbs)")

Example 2

What are the limitations of the ‘regular’ linear model?

model2 <- lm( weight ~ age , data = kid.weights )
model2 |> augment(interval="prediction") |>
  ggplot(aes(x=age,y=weight))+theme_bw(base_size = 18)+ 
  geom_ribbon(aes(y=.fitted, ymin=.lower, ymax=.upper),alpha=0.2,fill="blue") + 
  geom_line( aes(y=.fitted)) + geom_point() +
  labs(x="Age (months)",y="Weight (lbs)")

Enhancing our model 1

We can model how the error term changes with age.

model2b <- gamlss( weight ~ age , sigma.formula = ~age ,
                   data = kid.weights, control = gamlss.control(trace=F))
centiles.fan(model2b, cent=c(2.5,97.5))
points(kid.weights$age , kid.weights$weight)

Enhancing our model 1

summary(model2b)
******************************************************************
Family:  c("NO", "Normal") 

Call:  gamlss(formula = weight ~ age, sigma.formula = ~age,  
    data = kid.weights, control = gamlss.control(trace = F)) 

Fitting method: RS() 

------------------------------------------------------------------
Mu link function:  identity
Mu Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  13.9963     0.4337   32.27   <2e-16 ***
age           0.4910     0.0160   30.69   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

------------------------------------------------------------------
Sigma link function:  log
Sigma Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 1.087779   0.067469   16.12   <2e-16 ***
age         0.017055   0.001047   16.28   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

------------------------------------------------------------------
No. of observations in the fit:  250 
Degrees of Freedom for the fit:  4
      Residual Deg. of Freedom:  246 
                      at cycle:  3 
 
Global Deviance:     1662.226 
            AIC:     1670.226 
            SBC:     1684.312 
******************************************************************

Enhancing our model 2

We can also model how the skew changes with age!

model2c <- gamlss( weight ~ age , sigma.formula = ~age , nu.formula = ~age ,
            data = kid.weights , family="BCPE", control = gamlss.control(trace=F))
centiles.fan(model2c, cent=c(2.5,97.5))
points(kid.weights$age , kid.weights$weight)

Enhancing our model 2

summary(model2c)
******************************************************************
Family:  c("BCPE", "Box-Cox Power Exponential") 

Call:  gamlss(formula = weight ~ age, sigma.formula = ~age,  
    nu.formula = ~age, family = "BCPE", data = kid.weights,  
    control = gamlss.control(trace = F)) 

Fitting method: RS() 

------------------------------------------------------------------
Mu link function:  identity
Mu Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 13.41228    0.43597   30.77   <2e-16 ***
age          0.48394    0.01593   30.38   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

------------------------------------------------------------------
Sigma link function:  log
Sigma Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1.765940   0.065727 -26.868   <2e-16 ***
age          0.002114   0.001097   1.928    0.055 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

------------------------------------------------------------------
Nu link function:  identity 
Nu Coefficients:
             Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.389413   0.431880  -0.902    0.368
age         -0.007446   0.006028  -1.235    0.218

------------------------------------------------------------------
Tau link function:  log 
Tau Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   0.8196     0.1432   5.722 3.09e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

------------------------------------------------------------------
No. of observations in the fit:  250 
Degrees of Freedom for the fit:  7
      Residual Deg. of Freedom:  243 
                      at cycle:  7 
 
Global Deviance:     1624.405 
            AIC:     1638.405 
            SBC:     1663.056 
******************************************************************

Summary 2

  • We can model things other than the mean!

  • In the old days (when I was training) this would have been very difficult

  • Now gamlss and Bayesian analysis makes it easy

  • How good does a model need to be?

Learning objectives for today

  • Distributions and Modelling
    • Estimating parameters with linear models
    • Re-framing all hypothesis tests as linear models
    • Continuous and discrete predictors
    • Interpreting and diagnosing models
  • Extensions and generalisations:
    • Multiple predictors / interactions (more complex hypotheses)
    • Log-transformations
    • Logistic regression for binary data

Aim

  • Start to understand the ‘modelling’ approach to statistics

  • Enable you to explore this extensive framework to do almost any kind of statistical / ML analysis

  • Get familiar with how to do this in R

  • Become a statistician?

About me!

  • Statistician at Quadram Institute Bioscience since 2017

  • Support for epidemiology / genomics / ecology / clinical trials / pre-clinical research

    • Some statistical thinking goes a long way
    • Many problems would be much simpler if they were re-framed in terms of linear models.
    • Other approaches make life difficult and lead to inefficiencies and mistakes
    • Statistics is a super-power!

About you!

  • Name, previous work, experience with stats software.

Inference, Models and Distributions

A simple motivating problem

  • Suppose we are interested in understanding the typical heights of UEA students?

  • How should we go about this?

  • Can you frame a question in mathematical / statistical terms.

    • What should the answer look like?
    • How will we know we are done?

Setting the question

How should you set the scope of a statistical problem?

Inference

  • What is the target population?

  • What is the sampling frame?

  • What is the sample?

A distribution and a model

  • Students heights are not all the same, they have a ‘distribution’.

  • What kind of data is a ‘height?’

  • There is an average (central tendency) and a variance (spread).

  • We also have to assume or determine the shape of the distribution.

  • As a baseline - assume continuous variables are independently normally distributed

Model

  • So, for example:

    “Heights are normally distributed with mean 180cm and standard deviation 10cm”

  • We can represent this visually

  • Is it likely to be good enough to describe the distribution of student heights?

Models

What is a model?

Writing this model mathematically

\[\text{Height} \sim N(\text{mean} , \text{variance})\]

\[H \sim N(\mu , \sigma^2)\]

We describe how individual heights arise like this:

\[h_i = \mu + \epsilon_i\quad\text{where}\quad\epsilon_i \sim N(0,\sigma^2)\]

The model has a structural part and an error part

This is the simplest linear model.

So what is a model?

Here’s our model for heights:

\[ \boxed{h_i = \mu + \epsilon_i\quad\text{where}\quad\epsilon_i \sim N(0,\sigma^2)} \]

  • A model is a mathematical or statistical way of writing down how we think our outcome is distributed

  • The model needs to be good enough for the purpose!

    • All models are wrong, but some are useful
  • It needs to link the aspects of the process we are interested in (true but unknown mean \(\mu\) and variance \(\sigma^2\)) to the data that we can observe (\(h_i\)).

    • Here it’s just a mean and variance we want to learn about but we can add things to this equation if we think they affect height.

Do I need a model to do analysis?

  • Yes! Whenever you’re doing any kind of statistics there is an underlying model.

  • An implicit understanding of how your data came to be

    • how the data is generated and
    • how it was collected.
  • Ad hoc statistical procedures sometimes obscure hide these assumptions

  • But with linear modelling they are more explicit.

‘Doing’ statistics

Doing statistics with the model

  • Our model:

\[h_i = \mu + \epsilon_i\quad\text{where}\quad\epsilon_i \sim N(0,\sigma^2)\]

  • If we estimate \(\mu\) and \(\epsilon\), we have describes the height distribution and we are done!

Key insight!

Once we have a model, all of statistical analysis is just estimating model parameters!

Estimating the model

  • We can use data to estimate the parameters

  • If we get enough \(h_i\)’s (at least 2!) we can estimate the values of \(\mu\) and \(\sigma\).

  • Several ways these models are estimated:

    • Numerical optimisation
    • Solving with algebra
    • Maximum likelihood
    • Bayesian resampling
  • (Not getting into this..the computer will do it)

Note

We can use this logic to show that the sample mean is the best estimate for the \(\mu\)

Estimating the model using R

Enter some data and check it’s OK:

height = c(173,177,160,165,172,182,157,175,167)
dat <- data.frame(height)
head(dat)
  height
1    173
2    177
3    160
4    165
5    172
6    182

Estimate the model:

model1 <- lm( height ~ 1 , data=dat)

Display the model summary:

summary(model1)

Call:
lm(formula = height ~ 1, data = dat)

Residuals:
    Min      1Q  Median      3Q     Max 
-12.778  -4.778   2.222   5.222  12.222 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  169.778      2.722   62.37 4.86e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.167 on 8 degrees of freedom

Uncertainty

  • Is our parameter estimate perfect? Have we completely discovered the average height of UEA students?

  • The confidence interval and standard error reflect the uncertainty in the estimate.

  • (There is also uncertainty in the standard deviation estimate, but this is harder to calculate using R and we don’t often use it.)

Thinking about uncertainty

What can we really say about the mean and spread of student heights?

Is it plausible that the true mean height is 170cm?

linearHypothesis(model1, "(Intercept)=170")
Linear hypothesis test

Hypothesis:
(Intercept) = 170

Model 1: restricted model
Model 2: height ~ 1

  Res.Df    RSS Df Sum of Sq      F Pr(>F)
1      9 534.00                           
2      8 533.56  1   0.44444 0.0067 0.9369

Try some other values. 160cm, 163cm?

confint(model1)
               2.5 %   97.5 %
(Intercept) 163.5003 176.0552
broom::tidy(model1 , conf.int=TRUE)
# A tibble: 1 x 7
  term        estimate std.error statistic  p.value conf.low conf.high
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
1 (Intercept)     170.      2.72      62.4 4.86e-12     164.      176.
  • The confidence interval gives us every plausible value of the parameter given the data.

  • That is, every value not rejected at p<0.05 (for a 95% interval)

Diagnostics

  • How good is our model? What were the model assumptions?

  • Heights are normally distributed

  • Individuals heights are independent of each other

  • (Our sample is representative)

  • How can we check normality?

  • A qqplot compares model residuals with a normal distribution:

plot(model1 , which=2)
  • How might we check assumptions of independence and representativeness?

Summary 3

  • Statistical inference is trying to estimate something fixed but unknown in a population using observed data in a sample

  • The model links the unknown parameters to the observed data

  • Allows us to do prediction and inference

  • There is a duality between confidence intervals and p-values

  • We can check normality using qqplots. A visual inspection like this is better than statistical tests for normality that you might see.

Adelaide data

Exercise

A student questionnaire was conducted at Adelaide University. We’ll use this dataset to answer some interesting questions.

  1. Load data into R (from "survey.xlsx")

  2. Estimate a model in R to find the distribution of student heights.

  3. Check whether the heights are normally distributed

  4. Check whether using the mean and sd functions in R gives you the same answer.

Extending our linear model

Adding a predictor

  • Let’s improve our model.

  • Remind ourselves of the purpose of a model

    • We might have a specific hypothesis to test

    • Or we might want to predict an outcome for a new individual

    • Or just understand the factors that contribute to our outcome in a population

  • Can we improve our height prediction?

  • Can we add a factor to our model that helps to explain height?

Different average for men and women.

We might write this as:

\[ h_i=\left\{\begin{array}{ll} \mu_\text{male} + \epsilon_i & \text{(if male)}\\ \mu_\text{female} + \epsilon_i & \text{(if female)} \end{array}\right.\quad\text{where}\quad\epsilon_i \sim N(0,\sigma^2) \]

But we would usually write this as:

\[ \boxed{h_i = \mu + \beta x_i+\epsilon_i \quad\text{where}\quad \left\{\begin{array}{ll} x_i=1 &\text{if male;}\\ x_i=0 & \text{if female} \end{array}\right.} \]

Can you interpret this model?

What are the mean heights for men and women under this model?

What does \(\beta\) represent?

Estimating our new model using R

dat <- read_excel("survey.xlsx", na="NA") |> remove_missing()
table(dat$Sex)

Female   Male 
    84     84 
head(dat)
# A tibble: 6 x 13
   ...1 Sex    Wr.Hnd NW.Hnd W.Hnd Fold    Pulse Clap  Exer  Smoke Height M.I   
  <dbl> <chr>   <dbl>  <dbl> <chr> <chr>   <dbl> <chr> <chr> <chr>  <dbl> <chr> 
1     1 Female   18.5   18   Right R on L     92 Left  Some  Never   173  Metric
2     2 Male     19.5   20.5 Left  R on L    104 Left  None  Regul   178. Imper~
3     5 Male     20     20   Right Neither    35 Right Some  Never   165  Metric
4     6 Female   18     17.7 Right L on R     64 Right Some  Never   173. Imper~
5     7 Male     17.7   17.7 Right L on R     83 Right Freq  Never   183. Imper~
6     8 Female   17     17.3 Right R on L     74 Right Freq  Never   157  Metric
# ... with 1 more variable: Age <dbl>

Plotting

A graph is a good way to start modelling!

ggplot(dat , aes(x=Sex, y=Height)) + 
  geom_boxplot() + geom_beeswarm() + theme_bw()

Estimating with R

model2 <- lm( Height ~ Sex , data=dat)
extract_eq(model2)

\[ \operatorname{Height} = \alpha + \beta_{1}(\operatorname{Sex}_{\operatorname{Male}}) + \epsilon \]

extract_eq(model2 , use_coefs=TRUE)

\[ \operatorname{\widehat{Height}} = 165.6 + 13.75(\operatorname{Sex}_{\operatorname{Male}}) \]

Diagnostic plots

How well does our model fit its assupmtions

par(mfrow=c(2,2))
plot(model2)

Interpretation

  • Interpreting coefficients and confidence intervals
tidy(model2, conf.int=TRUE)
# A tibble: 2 x 7
  term        estimate std.error statistic   p.value conf.low conf.high
  <chr>          <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
1 (Intercept)    166.      0.787     211.  1.80e-203    164.      167. 
2 SexMale         13.7     1.11       12.4 2.71e- 25     11.6      15.9
extract_eq(model2, use_coefs = TRUE)

\[ \operatorname{\widehat{Height}} = 165.6 + 13.75(\operatorname{Sex}_{\operatorname{Male}}) \]

  • Marginal means
emmeans(model2 , ~Sex)
 Sex    emmean    SE  df lower.CL upper.CL
 Female    166 0.787 166      164      167
 Male      179 0.787 166      178      181

Confidence level used: 0.95 
  • Comparing models
model1 <- lm( Height ~ 1 , data=dat)
anova(model1, model2)
Analysis of Variance Table

Model 1: Height ~ 1
Model 2: Height ~ Sex
  Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
1    167 16565.1                                  
2    166  8627.5  1    7937.6 152.72 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Summary 4

  • A linear model can be used to replace a t-test

  • (Comparison of a single outcome across two independent groups)

  • Why should we prefer this to a t-test?

A multiple regression

Let’s repeat this process to understand how hand span depends on height.

ggplot(data=dat, aes(Height, Wr.Hnd, shape=Sex)) + geom_point() 

Can you write down a model for how hand span depends on height? Can you estimate it using R?

model3 <- lm(data=dat, Wr.Hnd ~ Height)
summary(model3)

Call:
lm(formula = Wr.Hnd ~ Height, data = dat)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.8644 -0.8644  0.0230  0.8140  4.8044 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -2.83544    1.96131  -1.446     0.15    
Height       0.12545    0.01135  11.051   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.461 on 166 degrees of freedom
Multiple R-squared:  0.4238,    Adjusted R-squared:  0.4204 
F-statistic: 122.1 on 1 and 166 DF,  p-value: < 2.2e-16

Interpret and diagnose (test model assumptions) this model

Multiple predictors

Can we better explain the hand width model by adding a second predictor?

model4 <- lm(data=dat, Wr.Hnd ~ Height + Sex)
summary(model4)

Call:
lm(formula = Wr.Hnd ~ Height + Sex, data = dat)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.5715 -0.8668  0.0558  0.9587  3.9863 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.90837    2.49397   1.567    0.119    
Height       0.08281    0.01503   5.509 1.36e-07 ***
SexMale      1.22353    0.29853   4.099 6.51e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.396 on 165 degrees of freedom
Multiple R-squared:  0.4771,    Adjusted R-squared:  0.4707 
F-statistic: 75.27 on 2 and 165 DF,  p-value: < 2.2e-16

Compare the two models. What do they tell you? Which is fitting the data better?

extract_eq(model3, use_coefs = TRUE)

\[ \operatorname{\widehat{Wr.Hnd}} = -2.84 + 0.13(\operatorname{Height}) \]

extract_eq(model4, use_coefs = TRUE)

\[ \operatorname{\widehat{Wr.Hnd}} = 3.91 + 0.08(\operatorname{Height}) + 1.22(\operatorname{Sex}_{\operatorname{Male}}) \]

sigma(model3)
[1] 1.461155
sigma(model4)
[1] 1.396227

Is the new model significantly better?

anova(model4, model3)
Analysis of Variance Table

Model 1: Wr.Hnd ~ Height + Sex
Model 2: Wr.Hnd ~ Height
  Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
1    165 321.66                                  
2    166 354.41 -1   -32.747 16.798 6.508e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Predictions

(ggplot(data=augment(model3), aes(Height, Wr.Hnd)) + geom_point() + geom_line(aes(y=.fitted), lwd=1))+
(ggplot(data=augment(model4), aes(Height, Wr.Hnd, color=Sex)) + geom_point() + geom_line(aes(y=.fitted), lwd=1) +scale_color_manual(values=c("black", "red")))

Interactions

Finally, we might ask if there’s any evidence that the slope is different between men and women.

How could we make a model to test this hypothesis?

model5 <- lm(data=dat, Wr.Hnd ~ Height + Sex + Height:Sex)
model5 <- lm(data=dat, Wr.Hnd ~ Height * Sex)
extract_eq(model5)

\[ \operatorname{Wr.Hnd} = \alpha + \beta_{1}(\operatorname{Height}) + \beta_{2}(\operatorname{Sex}_{\operatorname{Male}}) + \beta_{3}(\operatorname{Height} \times \operatorname{Sex}_{\operatorname{Male}}) + \epsilon \]

summary(model5)

Call:
lm(formula = Wr.Hnd ~ Height * Sex, data = dat)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.5776 -0.8609  0.0488  0.9659  4.0437 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)   
(Intercept)     5.58874    4.37585   1.277   0.2033   
Height          0.07266    0.02641   2.751   0.0066 **
SexMale        -1.33531    5.47715  -0.244   0.8077   
Height:SexMale  0.01505    0.03216   0.468   0.6405   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.4 on 164 degrees of freedom
Multiple R-squared:  0.4778,    Adjusted R-squared:  0.4682 
F-statistic: 50.01 on 3 and 164 DF,  p-value: < 2.2e-16
extract_eq(model4, use_coefs = TRUE)

\[ \operatorname{\widehat{Wr.Hnd}} = 3.91 + 0.08(\operatorname{Height}) + 1.22(\operatorname{Sex}_{\operatorname{Male}}) \]

extract_eq(model5, use_coefs = TRUE, wrap=TRUE, terms_per_line = 2)

\[ \begin{aligned} \operatorname{\widehat{Wr.Hnd}} &= 5.59 + 0.07(\operatorname{Height})\ - \\ &\quad 1.34(\operatorname{Sex}_{\operatorname{Male}}) + 0.02(\operatorname{Height} \times \operatorname{Sex}_{\operatorname{Male}}) \end{aligned} \]

(ggplot(data=augment(model3), aes(Height, Wr.Hnd)) + geom_point() + geom_line(aes(y=.fitted), lwd=1))+
(ggplot(data=augment(model4), aes(Height, Wr.Hnd, color=Sex)) + geom_point() + geom_line(aes(y=.fitted), lwd=1) +scale_color_manual(values=c("black", "red"))) + 
(ggplot(data=augment(model5), aes(Height, Wr.Hnd, color=Sex)) + geom_point() + geom_line(aes(y=.fitted), lwd=1) +scale_color_manual(values=c("black", "red"))) +
  plot_layout(guides="collect")

Model comparison

ANOVA compares the (1) improvement in fit between models to (2) the improvement that would be expected if there was no real difference in the population.

The p-values reflect the statistical significance of the improvement as we increase the model complexity.

anova(model3, model4, model5)
Analysis of Variance Table

Model 1: Wr.Hnd ~ Height
Model 2: Wr.Hnd ~ Height + Sex
Model 3: Wr.Hnd ~ Height * Sex
  Res.Df    RSS Df Sum of Sq       F    Pr(>F)    
1    166 354.41                                   
2    165 321.66  1    32.747 16.7184 6.777e-05 ***
3    164 321.23  1     0.429  0.2189    0.6405    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Is the interaction term needed?

Summary

  • We can add multiple predictors to improve a model
  • Linear models can replace ANCOVA
  • Interactions tell us whether effects are different across sub-groups
  • Models can be compared using p-values to test whether additional terms are needed

Extensions

Everything is a linear model

What statistical tests/procedures have we replaced with a linear model?

  • Unpaired T-test
  • Correlation
  • ANOVA
  • ANCOVA

All other statistical tests can be reframed in this way:

  • Models for count or categorical outcomes
  • Paired data
  • Non-parametric tests
  • More complex machine learning
  • Non-linear models are usually straightforward extensions

https://lindeloev.github.io/tests-as-linear/#9_teaching_materials_and_a_course_outline

Exercise

Look at the diversity data in R.

diversityDay <- read.csv("diversity.csv")

This is data from a real experiment to explore gut microbial diversity over time in hospital patients.

We’ll use linear models to test whether diversity changes over time.

  1. What is the mean and standard deviation of diversity across all samples?
  2. Plot how microbial diversity varies with time.
  3. Is microbial diversity correlated with time?
  4. Is the correlation statistically significant?
  5. What happens if we adjust for patient (add patient to the model)?

Dealing with non-normal data

  • So far we’ve seen data that can be modelled using normal distributions, where the relationships are genuinely straight lines.

  • Biological data very often does not act like this

  • What options do we have?

  • Funkier models?

  • Transformations?

Animal traits

The body mass of animals is linked to their metabolic rate.

animals <- read.csv("observations.csv")
ggplot(animals, aes(body.mass,metabolic.rate,col=phylum)) + 
  geom_point() + facet_wrap(~phylum, scale="free")

Can we fit a linear model to this?

Logarithmic transformation

The same data plotted on a different scale:

ggplot(animals, aes(body.mass,metabolic.rate,col=phylum)) + 
  geom_point() + scale_x_log10() + scale_y_log10()

Logarithmic transformations

Instead of modelling how \(y\) depends on \(x\), we can model how \(\log(y)\) depends on \(\log(x)\).

  • This is a very commonly used approach and it is surprising how often and how well it works.

\[\log(y_i) = a + b\log(x_i) + \epsilon_i\]

  • Implies

\[y_i = \exp(a + b\log(x_i) + e) = e^a\times x_i^{b}\times e^{\epsilon_i}\]

  • So instead of an additive model, we have a multiplicative one.

  • But we estimate using R as a regular linear model, using \(\log(y)\) as the outcome and \(\log(x)\) as the predictor.

modelAnimals <- lm(data=animals , log(metabolic.rate) ~ log(body.mass)*phylum)
summary(modelAnimals)

Call:
lm(formula = log(metabolic.rate) ~ log(body.mass) * phylum, data = animals)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.8236 -0.4098  0.0867  0.5732  2.4220 

Coefficients:
                              Estimate Std. Error t value Pr(>|t|)    
(Intercept)                   -0.55146    0.15437  -3.572 0.000368 ***
log(body.mass)                 0.86224    0.01496  57.641  < 2e-16 ***
phylumChordata                 1.45940    0.16753   8.711  < 2e-16 ***
log(body.mass):phylumChordata -0.25689    0.02493 -10.305  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9566 on 1180 degrees of freedom
  (2394 observations deleted due to missingness)
Multiple R-squared:  0.9588,    Adjusted R-squared:  0.9587 
F-statistic:  9161 on 3 and 1180 DF,  p-value: < 2.2e-16
augment(modelAnimals, interval="prediction") |>
  ggplot(aes(x=`log(body.mass)`, y=`log(metabolic.rate)`, col=phylum, fill=phylum)) +  
  geom_point() + 
  geom_line(aes(y=.fitted)) +
  geom_ribbon(aes(y=.fitted,ymin=.lower,ymax=.upper),alpha=0.2) + 
  facet_wrap(~phylum, scale="free")
animals$predicted <- predict(modelAnimals, newdata = animals, type="response")

augment(modelAnimals, interval="prediction") |>
  ggplot(aes(x=exp(`log(body.mass)`), y=exp(`log(metabolic.rate)`), col=phylum, fill=phylum)) +  
  geom_point() + 
  geom_line(aes(y=exp(.fitted))) +
  geom_ribbon(aes(y=exp(.fitted),ymin=exp(.lower),ymax=exp(.upper)),alpha=0.2) + 
  facet_wrap(~phylum, scale="free")

Summary 5

  • Linear models can be useful for data that does not look linear!

  • Next time:

  • Binary data (logistic regression)

  • Multilevel models (more complex errors)